Content Processing Services
This document provides comprehensive technical documentation for the content processing services that power intelligent data extraction and transformation across the system. These services are responsible for transforming raw, unstructured data from emails and official university sources into structured, actionable information for notification delivery. The focus areas include:
PlacementService: Extracting structured placement offer data from unstructured emails using Google Gemini LLM integration, with robust classification, extraction, validation, privacy sanitization, and retry mechanisms.
EmailNoticeService: Classifying and extracting structured notices from general email sources, including placement policy updates and non-placement notices.
OfficialPlacementService: Scraping and processing placement data from official university websites.
PlacementStatsCalculatorService: Generating analytics and statistics from processed placement data, enabling insights across branches, companies, and package distributions.
The documentation covers LLM integration patterns, data validation processes, content formatting workflows, and how these services collectively transform raw data into structured, actionable information for notification delivery.
The content processing services are organized within the services layer, with clear separation of concerns and dependency injection support. The services leverage reusable clients for external integrations and centralized configuration management.
Diagram sources
Section sources
This section introduces the four primary content processing services and their responsibilities:
PlacementService: Orchestrates a LangGraph pipeline to classify, extract, validate, sanitize, and display placement offer data from emails using Google Gemini LLM.
EmailNoticeService: Processes general notices from email sources, including placement policy updates, with LLM-based classification and extraction.
OfficialPlacementService: Scrapes official university placement pages to extract structured data about batches, recruiters, and package distributions.
PlacementStatsCalculatorService: Computes comprehensive statistics from placement offers, including branch-wise, company-wise, and package distribution metrics.
Each service implements dependency injection for flexibility and testability, integrates with configuration management, and interacts with the database service for persistence.
Section sources
The content processing architecture follows a modular, layered design with clear boundaries between services, clients, and core configuration. The services utilize LangGraph for workflow orchestration and Google Gemini for LLM-powered extraction and classification. External integrations are abstracted through dedicated clients, and configuration is centralized via environment variables.
Diagram sources
PlacementService Analysis#
PlacementService implements a sophisticated LangGraph pipeline to process placement offers from emails. The pipeline consists of four stages: classification, extraction, validation, and privacy sanitization, each with robust error handling and retry logic.
Key implementation patterns:
LangGraph workflow with conditional edges for decision-making
Pydantic models for strong data validation
LLM prompts designed for strict classification and extraction
Privacy sanitization to remove sensitive information
Retry mechanisms for robust extraction
Diagram sources
LLM Integration Patterns:
Classification stage uses keyword-based scoring with confidence thresholds
Extraction stage employs structured prompts with strict schema enforcement
Privacy sanitization removes headers, forwarded markers, and sensitive metadata
Retry logic with exponential backoff for validation failures
Data Validation Processes:
Pydantic validation ensures data integrity and type safety
Confidence scoring prevents false positives
Package extraction follows strict conversion rules (LPA, monthly to annual)
Role assignment defaults and enhancement logic
Content Formatting Workflows:
Privacy-first approach strips sensitive information
Structured JSON output with standardized schema
Enhanced metadata including sender, time_sent, and rejection reasons
Section sources
EmailNoticeService Analysis#
EmailNoticeService processes general notices from email sources using a LangGraph pipeline with LLM-based classification and extraction. The service distinguishes between placement notices, policy updates, and general announcements.
Key implementation patterns:
Unified LangGraph workflow for notice processing
LLM-based classification eliminating manual keyword filtering
Specialized handling for placement policy updates
Notice formatting and database persistence
Diagram sources
Notice Types and Extraction:
Announcement, job posting, shortlisting, update, webinar, reminder, hackathon, internship_noc
Structured extraction with type-specific fields and validation
Privacy-preserving content extraction
Policy Update Processing:
Dedicated extraction for placement policy documents
Advanced LLM prompting for policy content
Integration with PlacementPolicyService for storage
Section sources
OfficialPlacementService Analysis#
OfficialPlacementService scrapes official university placement pages to extract structured data about batches, recruiters, and package distributions. The service focuses on parsing HTML content and converting it into standardized data models.
Key implementation patterns:
HTML parsing with BeautifulSoup for robust content extraction
Data model validation using Pydantic
Batch processing across multiple tabs and content sections
Image URL normalization and metadata extraction
Diagram sources
Data Extraction Strategies:
Targeted selectors for main heading, introductory text, and recruiter logos
Batch navigation parsing with active state detection
Package distribution table extraction with category mapping
Pointer list extraction for placement achievements
Section sources
PlacementStatsCalculatorService Analysis#
PlacementStatsCalculatorService computes comprehensive statistics from processed placement data, enabling insights across branches, companies, and package distributions. The service implements sophisticated aggregation logic with configurable enrollment ranges and student counts.
Key implementation patterns:
Branch range resolution using enrollment number patterns
Package calculation with highest-per-student logic
Multi-dimensional statistics computation (branch-wise, company-wise)
Filterable analytics with CSV export capability
Diagram sources
Statistical Computation Logic:
Enrollment range mapping for branch identification
Highest package per unique student for fair statistics
Placement percentage calculation using eligible student pools
Multi-level aggregation with filtering capabilities
Filtering and Export Capabilities:
Branch exclusion for JUIT, Other, MTech categories
Company, role, location, and package range filters
CSV export with standardized column headers
Search query support across multiple fields
Section sources
The content processing services exhibit clear dependency relationships and integration points:
Diagram sources
Dependency Coupling and Cohesion:
High cohesion within each service around specific responsibilities
Low coupling through dependency injection and interface abstraction
Centralized configuration management reducing cross-service coupling
Clear separation between data extraction and formatting concerns
Integration Points:
Google Gemini API for LLM-powered processing
IMAP protocol for email ingestion
MongoDB for persistent storage
Web scraping for official placement data
Section sources
The content processing services implement several performance optimizations:
LangGraph workflow optimization: Parallel processing of independent nodes, minimal state copying, and efficient conditional routing
LLM cost optimization: Temperature control set to zero for deterministic responses, structured prompts to reduce token usage
Database efficiency: Bulk operations where possible, indexed lookups for notice existence checks, and selective field projections
Memory management: Streaming HTML parsing for large documents, lazy evaluation of expensive computations
Retry strategies: Exponential backoff for LLM extraction failures, avoiding infinite loops through retry limits
Best practices for deployment:
Environment variable caching for configuration access
Connection pooling for database operations
Asynchronous processing for independent notice processing
Monitoring and logging for performance metrics
Common issues and their resolutions:
LLM Integration Issues:
API key configuration errors: Verify GOOGLE_API_KEY environment variable
Model availability problems: Check Gemini model quotas and rate limits
Prompt formatting errors: Review structured prompt templates for schema compliance
Email Processing Issues:
Authentication failures: Verify PLCAMENT_EMAIL and PLCAMENT_APP_PASSWORD
IMAP connectivity problems: Check network connectivity and firewall settings
Forwarded email parsing: Use extract_forwarded_date and extract_forwarded_sender utilities
Data Validation Errors:
Pydantic validation failures: Review schema requirements and data types
Package extraction inconsistencies: Verify LPA conversion rules and monthly to annual conversions
Branch resolution mismatches: Check enrollment range configurations
Database Connectivity:
Connection string issues: Validate MONGO_CONNECTION_STR environment variable
Collection initialization failures: Ensure database migrations are complete
Write operation errors: Check write permissions and document size limits
Section sources
The content processing services provide a robust, scalable foundation for transforming raw data into structured, actionable information. Through careful separation of concerns, LLM-powered intelligence, and comprehensive validation, these services enable reliable notification delivery across placement offers, general notices, and official placement data. The modular architecture supports easy maintenance, testing, and extension for future requirements.
The services demonstrate best practices in:
LLM integration patterns with structured prompts and validation
Data validation using Pydantic models and confidence scoring
Privacy-preserving content processing and sanitization
Comprehensive analytics and reporting capabilities
Dependency injection and configuration management
These components work together to create a comprehensive content processing pipeline that reliably transforms diverse data sources into consistent, useful information for notification systems.